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Abstract: This chapter surveys advances in the field of Bayesian compu- 
tation over the past twenty years, with missing data. It also contains some 
novel computational entries on the double-exponential model that may be 
*y\ of interest per se. 
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1. Introduction 



o 

< 

[^ It has long been a bane of the Bayesian approach that the sohitions it proposed 

were intellectually attractive but inapplicable in practice. While some numerical 
analysis solutions were suggested (see, e.g. Smith, 1984), they were not in par 
with the challenges raised by handling non-standard probability densities, espe- 
cially in high dimensional problems. This stumbling block in the development of 
the Bayesian perspective became clear when new simulations methods became 
available in the early 1990 's and the number of publications involving Bayesian 
, ^, methods rised significantly (no test available!). While those methods were on 

principle open to any type of inference, they primarily benefited the Bayesian 
T-H paradigm as they were "ideally" suited to the core object of Bayesian inference, 

^ namely a mostly intractable posterior distribution. 

^V This chapter will not cover the historical developments of computational 

^^ methods (see, e.g., Robert and CascUa, 2011) nor the technical implementation 

psj details of simulation techniques (see, e.g., Robert and Casella, 2004, Robert and 

l' Casella, 2009), but instead focus on examples of application of those methods 

f~^ to Bayesian computational challenges. Given the length of the chapter, it is to 

C^ be understood as a sequence of illustrations of the main computational tools, 

^~H rather than a comprehensive introduction, which is to be found in the books 

>^ mentioned above and below. 



;_( 2. Some computational challenges 



The starting point of a Bayesian analysis being the posterior distribution, let 
us recall that it is defined by the product 

Tr{d\x)(X7T{e)f{x\e) 
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/Bayesian Computational Tools 2 

where 9 denotes the parameter and x the data. (The symbol oc means that the 
functions on both sides of the symbol are proportional as functions of 9, the 
missing constant being a function of x, m{x).) The structures of both 9 and x 
can vary in complexity and dimension, although we will not discuss the non- 
parametric case when 9 is infinite dimensional, referring the reader to Holmes 
et al. (2002) for an introduction. The prior distribution is most often available 
in closed form, being chosen by the experimenter, while the likelihood function 
f{x\9) may be too involved to be computed even for a given pair {x,9). In special 
cases where f{x\9) allows for a demarginalisation representation 

f{x\9) = J f{x,z\9)dz, 

where g{x, z\9) is a (manageable) probability density, we will call z the missing 
data. However, the existence of such a representation does not necessarily implies 
it is of any use in computations. (We will encounter both cases in Sections 4 
and 5.) 

Since the posterior distribution is defined by 

7ri9\x) = TT{9)f{x\9)/ 1^ 7r{9)f{x\9) d9 

a first difficulty occurs because of the normalisation constant: the denominator 
is very rarely available in closed form. This is an issue only to the extent that 
the posterior density is defined up to a constant. In cases where the constant 
does not matter, inference can be easily conducted without the constant. Cases 
when the constant matters include testing and model choice, since the marginal 
likelihood 



m{x) = / 7r(6l)/(2;|6l)de' 
Jo 

is central to the Bayesian procedures addressing this inferential problem (see, 
e.g., Robert, 2001, Chapter 5). This means that for almost any inference problem- 
barring the very special case of conjugate priors — there is a computational issue, 
not the most promising feature for promoting an inferential method. This aspect 
has obviously been addressed by the community, see for instance Chen ct al. 
(2000) that is entirely dedicated to the problem of approximating normalising 
constants or ratios of normalising constants, but I regret it is not acknowledged 
more clearly as one of the major computational challenges of Bayesian statistics 
(see also Marin and Robert, 2011). 

Example 1 As a benchmark, consider the case (Marin ct al., 2011a) when a 
sample {xi, . . . , Xn) can be issued either from a normal A/'(/i, 1) distribution or 
from a double-exponential C{fi, l/v2) distribution with density 

fo{x\fJ-) = ^ exp{-V2\x - fi\} . 

Then, as it happens, the Bayes factor is available in closed form, since, under 
a normal /z ~ A/'(0,cr^) prior, the marginal likelihood for the normal model is 
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given by 

/n 
(27r)-"/2 Y^ exp{-(x, - fi)'^/2} exp{-^V2cr2} dn/V2^cr 

i—1 

n 

= (27r)-"/2exp{-^(x,-x„)V2} 

1=1 

X / exp[— {(n + a^'^)ii^ — 2ri/ix„ + n(x„)^}/2] d/i/v27rcr 

n 

= (27r)-"/2 exp{- ^(a;, - x„) 2/2} 

1=1 

X exp{-ncr~2(S„)2/2(n + (T^2)}/crVn + o— ^ 
and, for the double-exponential model, by (assuming the sample is sorted) 

/n 
2-"/2 jj cxp{-\/2|a;i - /i|} exp{-/i2/2tj2} Aii.l\ptKa 

9— n/2 " rxi+i 



y^ / TJ f,V2xj-V2^ TT g-y2x,+y2A'g-A'V2<T" (J^ 

i=0 "'^- i = l i = J+l 

O—n/2 " /-a^^i+i 

^ y^ / g\/2E- = i2;j-\/2 2:"=.+ i2^j+\/2(n-2j)Mg-MV2'j'(J„ 

V2^a^y.. 

n 
2-«/2 y^ e^^J = i ^^-^^"=.+ 1 ^j+2(n-2j)VV2 

{M-/2(„-2»)a^}V2a^ dAi/\/2^a 



4=0 

X / e" 



4=0 



X [$({0:4+1 - \/2(n - 2i)cr2}/cr) - $({a;, - V2{n - 2i)(7'^}/a) 

with obvious conventions when i = {xq = —00) and i = n (a;„+i = +00). To 
illustrate the consistency of the Bayes factor in this setting, Figure 1 represents 
the distributions of the Bayes factors associated with 100 normal and 100 double- 
exponential samples of size 200. In both cases, the log-Bayes factor distribution 
concentrates on the proper side of zero, meaning that it eventually discriminates 
correctly between the two distributions. -4 
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Fig 1. Repartition of the values of the log- 
Bay es factors associated with 100 nor- 
mal (orange) and 100 double- exponential 
samples of size 200 (blue), estimated by 
the default R density estimator. 



Another recurrent difRculty with pos- 
terior distributions is the derivation 
of credible sets — the Bayesian version 
of confidence sets (see, e.g., Robert, 
2001) — since they are usually defined 
as highest posterior density regions: 

C^{x) - {B- Ti{e\x) > K^{X)} , 

where the bound k^ is determined by 
the credibility of the set 

¥{6 e Ca{x)\x) = a . 

While the normalisation constant is 
irrelevant in this problem, determin- 
ing the collection of parameter values 
such that 7r(6')/(a;|6') > Ka{x) and cal- 
ibrating the lower bound Ka{x) on the 
product 7r{9)f{x\9) to achieve proper 



coverage are non-trivial problems that require advanced simulation methods. 
Once again, the issue is somehow overlooked in the literature. 

While one of the major appeals of Bayesian inference is that it is not reduced 
to an estimation technique — but on the opposite offers a whole range of infer- 
ential tools to analyse the data against the proposed model — , the computation 
of Bayesian estimates is certainly one of the better addressed computational 
issues. This is especially true for posterior moments like the posterior mean 
E'^[6'|a;] since they are directly represented as ratios of integrals 



E'^[6'|x] = 



f^0n{0)f{x\0)de 
J^7r{9)f{x\9)d0 



The computational problem may however get involved for several reasons, in- 
cluding for instance 

- the space O is not Euclidean and the problem imposes shape constraints 
(as in some time series models); 

- the dimension of O is large (as in non-parametrics); 

- the estimator is the solution to a fixed point problem (as in the credible 
set definition); 

- simulating from Tr{9\x) is delicate or even impossible; 

the latter case being in general the most challenging and thus the most studied, 
as the following sections will show. 

3. Monte Carlo methods 

Monte Carlo methods have been introduced by physicists in Los Alamos, namely 
Ulam, von Neumann, Metropolis, and their collaborators in the 1940's (see 
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Robert and Casella, 2011). The idea behind Monte Carlo is a straightforward ap- 
plication of the law of large numbers, namely that, when xi,X2, ■ ■ ■ are i.i.d. from 
the distribution /, the empirical average 



^Em^*) 



converges (almost surely) to Ef[h{X)] when T goes to +00. While this per- 
spective sounds too simple to apply to complex problems — either because the 
simulation from / itself is intractable or because the variance of the empirical 
average is too large to be manageable — , there exist more advanced exploitations 
of this result that lead to efficient simulation solutions. 



Example 1 (bis) Consider computing the Bayes factor 
Soi(a;i, . . . , a;„) = mo(xi, . . . , x„)/toi(xi, . 



) •^n) 



by simulating a sample (/^i, . . . , fix) from the prior distribution, Af{0, o"^). The 
approximation to the Bayes factor is then provided by 



^i=En/o(^'i^*)/En/i(^»i^*) 



t=i i=i 



t=l i=l 



given that in this special case the same prior and the sam,e Monte Carlo samples 
can be used. Figure 2 shows the convergence of *8oi over T = 10^ iterations, 
along with the true value. The method exhibits convergence. M 




Fig 2. Convergence of a 

Monte Carlo approximation 
of Q3oi(a;i, ■ . ■ ,Xn) for a normal 
sample of size n — 19, along with 
the true value (dash line). 



Although importance sampling is at the source 
of the particle method, whiles the techniques 
of control variates and antithetic and parti- 
tioned samplings all but fail to be mentioned 
in standard textbooks (see Robert and Casella, 
2011), let me briefly introduce the notion of 
bridge sampling (Mcng and Wong, 1996) as 
it applies to the approximation of Bayes fac- 
tors 

*Bi2(a;)= / /i(a;|0i)^i(0i)d0i 



/2(a;|02)7r2(02)d02 



02 



(and to other ratios of integrals). This method 
handles the approximation of ratios of in- 
tegrals over identical spaces (a severe con- 
straint), by reweighting two samples from 
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both posteriors, through a weU-behaved type 
of harmonic average. 

More specifically, when Oi =02, possibly after a reparameterisation of both 
models to endow 9 with the same meaning, we have 

"BMx) = / fi{x\e)7ri{e)a{e)TT2{e\x)de / f h{x\e)n2{9)a{0)ni{9\x)d9 



'^2-^E;:i/i(^i^2,,>i(g2,,)a(g2.j-) 

ni-'E;Uf2ix\9^^,)n2{euj)a{9^,,) 

where 9i_i, . . . ,9i_ni and ^2,1, ■ • • , ^2,ni are two independent samples coming 
from the posterior distributions tti{9\x) and tt2{9\x), respectively. (This identity 
holds for any function a guaranteeing the integrability of the products.) How- 
ever, there exists a quasi-optimal solution, as provided by Gelman and Meng 
(1998): 

a* {9) oc -— n"^ ■ 

niTri[9\x) +n2Tr2[d\x) 

While this optimum cannot be used — given that it relies on the normalising 
constants of both 7ri(-|x) and 7r2(-|x) — , a practical implication of the result 
resorts to an iterative construction of a*. We gave in Chopin and Robert (2010) 
an alternative representation of the bridge factor that bypasses this difficulty (if 
difficulty there is!). 

Example 1 (ter) If we want to apply the bridge sampling solution to the nor- 
mal versus double-exponential example, we need to simulate from the posterior 
distributions in both models. The normal posterior distribution on /.i is a normal 
N[nxn/[n + a~^), l/(n -|- cr~^)) distribution, while the double-exponential dis- 
tribution can be derived as a mixture of (n-|- 1) truncated normal distributions, 
following the same track as with the computation of the marginal distribution 
above. The sum obtained in the above expression of mo(xi, . . . ,a;„) suggests 
interpreting Tro{fj,\xi, . . . , Xn) as (once again assuming x sorted) 



4=0 



0JiJ\f^{V2{n - 2i)a^, a^, x^, x^+i) 



where Af"^ {5,T'^,a, /3) denotes a truncated normal distribution, that is, the nor- 
mal Af{S, T^) distribution restricted to the interval (a, /3), and where the weights 
uji are proportional to those summed in mo{xi, . . . ,Xn) (see Example 1 (bis)). 
The outcome of one such simulation is shown in Figure 3 along with the tar- 
get density: as seen there, since the true posterior can be plotted against the 
histogram, the fit is quite acceptable. If we start with an arbitrary estimation 
of *Boi like faoi = 1, successive iterations produce the following values for the 
estimation: 11.13, 10.82, 10.82, based on 10"* samples from each posterior dis- 
tribution (to compare with an exact ratio equal to 10.3716 and a Monte Carlo 
approximation of 10.55). -^ 
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While this bridge solution produces valu- 
able approximations when both param- 
eters 9i and 62 are within the same " 1 _ ' 
parameter space and have the same ' " " > 
or similar absolute meanings (e.g., 9 ^ _ / •• 
is equal to Ee[X] in both models), it /' \ 
does not readily apply to settings with j ' \ 
variable dimension parameters. In such ° / \ 
cases, separate approximations of the /' » 
evidences, i.e. of the numerator and ~ - / \ 
denominator in *8i2 are requested, with /' \ 
the exception of reversible jump Monte „ J 

Carlo techniques (Green, 1995) pre- _„^;^ :^^ ^^ ^^^ ^,„ 

sented in the following section. We re- n 

fer the reader to Marin and Robert 

(2011) for a model based solution us- ""f 3. mstogram of 10 simulations from 
. . . the posterior distribution associated with 

mg harmonic means and an importance ^ double- exponential sample of size 150, 
function restricted to an HPD region along with the curve of the posterior 
(see also Robert and Wraith, 2009 and (dashed lines). 
Weinberg, 2012). We however insist on 

(and bemoan) the lack of generic solution for the approximation of Bayes factors, 
despite those being the workhorse of Bayesian model selection and hypothesis 
testing. 

4. MCMC methodology 

The above Monte Carlo techniques impose (or seem to impose) constraints on 
the posterior distributions that can be approximated by simulation. Indeed, 
direct simulation from the posterior is not always feasible in a (time- wise) man- 
ageable form, while importance sampling requires a sufficient precision in the 
approximation of 7r(-|2:) for the resulting estimators to enjoy a moderate or at 
least finite variance. Markov chain Monte Carlo (MCMC) methods were intro- 
duced (also in Los Alamos) with the purpose of bypassing this requirement of 
an a priori knowledge on the target distribution. On principle, they apply to 
any setting where 7r(-|x) is known up to a normalising constant (or worse, as a 
marginal of a distribution on an augmented space) . 

As described in another chapter of this volume (Craiu and Rosenthal, 2013), 
MCMC methods rely on ergodic theorems, i.e. the facts that, for positive re- 
current Markov chains, (a) the limiting distribution of the chain is always the 
stationary distribution and (b) the law of large numbers applies. The fascinat- 
ing feature of those algorithms is that it is straightforward to build a Markov 
chain (kernel) with a stationary distribution equal to the posterior distribution, 
even when the latter is only know up to a normalising constant. Obviously, 
there are caveats with this rosy tale: complex posteriors remain harder to ap- 
proximate than essentially Gaussian posteriors, convergence (ergodicity) may 
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require in-human time ranges or simply not agree with the hmited precision of 
computers. 

For completeness' sake, we recall here the format of a random walk Metropolis- 
Hastings (RWMH) algorithm (Hastings, 1970) 

Algorithm 1 RWMH 

for t = 1 to T do 

Generate ^ ~ vd? — ^t—i\) 

Take 9t = ^ with probability a = min{l, /o(x|^)7ro(5)//o{x|6»t_i)7ro(6't_i) 

Take 6t = 9t—i otherwise. 
end for 



Example 1 (quater) If we consider once 
again the posterior distribution on /i associ- 
ated with a Laplace sample, even though the 
exact simulation from this distribution was 
implemented in Example 1 (ter), an MCMC 
implementation is readily available. Using a 
RWMH algorithm, with a normal distribu- 
tion centred at /it_i and with scale cr, the 
implementation of the method is straightfor- 
ward. 

As shown on Figure 4, the algorithm is 
less efficient than an iid sampler, with an ac- 
ceptance rate of only 6%. However, one must 
also realise that devising the code behind the 
algorithm only took five lines and a few min- 
utes, compared with the most elaborate con- 
struction behind the iid simulation! -^ 



Fig 4. Values of the Markov chain 
(Ht) (sienna) and of iid simula- 
tions (wheat) for 10^ iterations 
and a double exponential sample 
of size n = 150, when using a 
RWMH algorithm with scale equal 
to 1. 



4.1. Gibbs sampling 

A special class of MCMC methods seems to have been especially designed for 
Bayesian hierarchical modelling (even though they do apply in a much wider 
generality). Those go under the denomination of Gibbs samplers, unfortunately 
named after Gibbs for the mundane reason that one of their initial implementa- 
tions was for the simulation of Gibbs random fields (in image analysis, Geman 
and Geman, 1984). Indeed, Gibbs sampling addresses the case of (often) high- 
dimensional problems found in hierarchical models where each parameter (or 
group of parameters) is endowed with a manageable full conditional posterior 
distribution. (While the joint posterior is not manageable.) The principle of the 
Gibbs sampler is then to proceed by local simulations from those full condi- 
tionals in a rather arbitrary order, producing a Markov chain whose stationary 
distribution is the joint posterior distribution. 
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Let us recall that a Bayesian hierarchical model is build around a hierarchy 
of probabilistic dependences, each level depending only on the neighbourhood 
levels (except for global parameters that may impact all levels). For instance, 

X ^ /(x|0i) , e,\9-2 ^ 7r,{0,\d2) , 02 ^ ^2(^2) 

induces a simple hierarchical model in that x only depends on 9i while 62 only 
depends on 9i — i.e., x is independent of 62 given 61. 
Examples of such structures abound: 

Example 2 A typical instance is made of random effect models as in the follow- 
ing instance (inspired from Brcslow and Clayton, 1993) of Poisson observations 
{i = l,...,n,j^l,...,Nj) 

Xy ~ V{eJq){^J.i + Sij}) 



ey- 


-AA(0,p') 


^^i -- 


= log rrii + z^/3 


P^ 


^NMaHa) 


a\p'^ 


^ 7r(a;) = \/uj 



where i denotes a group or district label, j the replication index, z^ a vector 
of covariates, mi a population size. In this model, given the data x = {xij, i = 
1, . . . , n, J = 1, . . . , Nj}, a Gibbs sampler generates from the joint distribution 
of Cij, /3, cr^, and p^ by using the conditionals 

Cij ^^ 7ryCij\X{j ^ Pi^ p j 

/3~^(/3|x,e,a2) 
p'-<p'\e) 

which are more or less manageable (as they may require individual Metropolis- 
Hasting implementations where the Poisson distribution is replaced with its 
normal approximation in the proposal). A 

Example 3 A growth measurement model was applied by Potthoff and Roy 
(1964) to dental measurements of 11 girls and 16 boys, as a mixed-effect model. 
(The dataset is available in R as orthodont in package nime.) Compared with 
the random effect models, mixed-effect models include additional random-effect 
terms and are more appropriate for representing clustered, and therefore de- 
pendent, data arising in, e.g., hierarchical, paired, or longitudinal data.) For 
i — 1, . . . , n children and j — 1, . . . , r observations on each child, growth is 
expressed as 
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10 




Vij = tti + I3h,tj + al^Cij , 

where h = {hi, ... , hn) is a sex 
factor with hi g {1,2} (1 corre- 
sponds to female and 2 to male) 
and t = (ti , . . . , ij. is the vector 
of ages. The random effects in 
this growth model are the ai's, 
which are independent N {^nu- , r^ 
variables. The priors on the cor- pj^ 5 Directed acyclic graph associated with the 
responding parameters are cho- Bayesian modelling of the growth data of Pot- 
sen to be conjugate: th".ff "-^d Roy (1964). 



/3i,/32^A/'i(0,(t|) 



^2 2 ji 

0^1 , 0-2 ; ^ 



XQ{a,a), a\^XQ{a,a), /ii, /i2 -- M (O, cr^) 



Figure 5 summarises the Bayesian model through a DAG (directed acyclic graph, 
see (Lauritzcn, 1996)). 

Thanks to this conjugacy, the full conditionals are available as standard dis- 
tributions (fc = 1,2): 



h-M 



Er i2 — 2 I —2 






ig \ a + "fc V2, a + Y^ lh,=k ^ iVrj 



i=l 






/?ii, 



V nfcr-2 -h CTu 



a. 



f/2 



'Ig[a + ^^/2,a + Y^{a,-^ihf/2\ , 



=1 



where Uk is the number of children with sex fc, and [i = 1, . . . , n) 



'N 



-,{--' + -.t)" 



It is therefore straightforward to run the associated Gibbs sampler. Figures 6 and 
7 show the raw output of some parameter series, based on 120, 000 iterations. 
For instance, those figures show that /3i and ^2 are possibly equal, as their likely 
ranges overlap. This does not seem to hold for /ii and ^2- •^ 

One of the obvious applications of the Gibbs sampler is found in graphical 
models — an application that occurred in the early days of MCMC — since those 
models are defined by and understood via conditional distributions rather than 
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Trace of beta[1] 



Density of beta[1] 




20000 40000 60000 80000 100000 120000 

Trace of beta[2] 



8 10 12 

Density of beta[2] 
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Trace of mu[1] 





20000 40000 60000 80000 100000 120000 

Trace of mu[2] 



120 140 160 180 200 

Density of mu[2] 




20000 40000 60000 80000 100000 120000 



180 200 



Fig 6. Evolution of the Gihhs Markov chains for some parameters of the growth mixed- 
effect model of Pothoff and Roy (1964) (right) and density estimate of the correspond- 
ing posterior distribution (right), based on 120,000 iterations. 
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Trace of sigma2[1] 



Density of sigma2[1] 





20000 40000 60000 80000 100000 120000 100 200 300 400 500 600 700 

Trace of sigrna2[2] Density of sigma2[2] 




20000 40000 60000 80000 100000 120000 

Trace of tau2 



50 100 150 

Density of tau2 




20000 40000 60000 80000 100000 120000 




500 1000 



Fig 7. Same legend as Figure 6. 
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through an unmanageable joint distribution. As detailed in Lauritzen (1996), 
undirected probabilistic graphs are Markov with respect to the graph structure, 
which means that variables indexed by a given node 77 of the graph only depend 
on variables indexed by nodes connected to 77. For instance, if the vector indexed 
by the graph is Gaussian, X ^ A/'(^, I), the non-zero terms of L~^ correspond to 
the edges of the graph. Applications of this modelling abound, as for instance in 
experts systems (Spicgelhalter et al., 1993). Note that hierarchical Bayes models 
can be naturally associated with dependence graphs and thus fall within this 
category as well. 

4.2. Reversible-jump MCMC 

Although the principles of the MCMC methodology are rather straightforward 
to understand and to implement, resorting for instance to down-the-shelf tech- 
niques like RWMH algorithms, a more challenging setting occurs with the case 
of variable dimensional problems. These problems typically occur in a Bayesian 
model choice situation, where several (or an infinity of) models are considered 
at once. The resulting parameter space is a millefeuille collection of sets, with 
most likely different dimensions, and moving around this space or across those 
layers is almost inevitably a computational issue. Indeed, the only case open 
to direct computation is the one when the posterior probabilities of the models 
under comparison can be evaluated, resulting in a two-stage implementation, 
the model being chosen first and the parameters of this model being simulated 
"as usual" . However, as seen above, computing posterior probabilities of models 
is rarely a straightforward case. In other settings, moving around the collection 
of models and within the corresponding parameter spaces must occur simulta- 
neously, especially when the number of models is large or infinite. 

Defining a Markov chain kernel that explores the multi-layered space is chal- 
lenging because of the difficulty of defining a reference measure on this complex 
space. However, Green (1995) came up with a solution that is rather simplex 
to express (if not necessarily to implement). The idea behind Green's (1995) 
reversible jump solution is to take advantage of the Markovian nature of the 
algorithm: since all that matters in a Markov chain is the most recent value of 
the chain, exploration of a multi-layered space, 

only involves a pair of sets O, at each step, ©^ and Q^: say. Therefore, the 
mathematical difficulty reduces to create a connection between both spaces, dif- 
ficulty that is solved by Green's (1995) via the introduction of auxiliary variables 
Al and At: in order for (0^, AJ and (0^, A^) to be in one-to-one correspondence, 
i.e. {9i,Xi) — ^(^TjAx). Arbitrary distributions on Ai and on A^ then come to 
complement the target distributions tt{l,9^\x) and 7r(T, 6'T-|a;). The algorithm is 
call reversible because the symmetric move from (0^, AJ to (0-r, A-r) must follow 
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{9t:,Xt:) = ^^^(^LjA^). In other words, moves one way determine moves the 
other way. A schematic representation is as follows: 

Algorithm 2 RJMCM 

for t = 1 to T do 

Given current state (i,8i), 

Generate index T from the prior probabilities 7r(T). 

Generate Ai from the auxiliary distribution n^{X^) 

Compute (6It, At) = ^-^{e„\,) 

Accept to switch to {i,6i) with probability 



7r(T,6'T|x)7rT(AT) 



d*(6»T,AT) 



dCe^.A^-) 



Else reproduce (i, 9t) 
end for 



The important feature in the above acceptance probability is the Jacobian 
term d^(0T, AT)/d(0x, Ax) which corresponds to the change of density in the 
transformation. It is also a source of potential mistakes in the implementation 
of the algorithm. 

The simplest version of RJMCM is when 9,^ — (6'i,,Al), i.e. when the move 
from one parameter space to the next involves adding or removing one param- 
eter, as for instance in estimating a mixture with an unknown number of com- 
ponents (Richardson and Green, 1997) or a MA(p) time series with p unknown. 
It can also be used with p known, as illustrated below. 

Example 4 An MA{p) time series model — where MA stands for 'moving average'- 
is defined by the equations 

p 

i=l 

where the e^'s are iid A/'(0, cr^). While this model can be processed without 
RJMCMC, we present here a resolution explained in Marin and Robert (2007) 
that does not distinguish between the cases when p is known and when p is 
unknown. 

The associated "lag polynomial" 'P(B) = I -I- ^^^-^ fJjB' provides a formal 
representation of the series as Xt = V(B)et, with let — ft, Bet — ct-i, ■•• As a 
polynomial it also factorises through its roots Ai as 



V{B) = f[(I - A,B) 



i=\ 

While the number of roots is always p, the number of (non-conjugate) complex 
roots varies between (meaning no complex root) and [P/2J . This representation 
of the model thus induces a variable dimension structure in that the parameter 
space is then the product (—1,1)'' x i3(0, 1)''"'^/^, where i?(0, 1) denotes the 
complex unit ball and r is the number of real valued roots A^i?. The prior 
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distributions on the real and (non-conjugate) complex roots are the uniform 
distributions on (—1, 1) and -6(0, 1), respectively. In other words, 

-w=uJtt n ^v.i<i n^wi)(^o, (1) 

'- ' -' A,e(-l,l) A.^R 

Moving around this space using RJMCMC is rather straightforward: either the 
number of real roots does not change in which case any regular MCMC step 
is acceptable or the number of real roots moves up or down by a factor of 2, 
new roots being generated from the prior distribution, in which case the above 
RJMCMC acceptance ratio reduces to a likelihood ratio. An extra difficulty with 
the AIA{p) setup is that the likelihood is not available in closed form unless 
the past innovations eg, e_i, . . . , ei_p are available. As explained in Marin and 
Robert (2007), they need to be simulated in a Gibbs step, that is, conditional 
upon the other parameters with density proportional to 

Jl exp {-e?/2o-2} J]^ exp <! - a;t - ^t + ^ 'djtt^ 

t=o t=i I y j=i 

where ep = eo,. . ., ei_p = ei_p and (t > 0) 

p 

This recursive definition of the likelihood is rather costly since it involves com- 
puting the Ef's for each new value of the past innovations, hence T sums of p 
terms. Nonetheless, the complexity 0(Tp) of this representation is much more 
manageable than the normal exact representation mentioned above. < 

As mentioned above, the difficulty with RJMCM is in moving from the general 
principle (which indeed allows for a generic exploration of varying dimension 
spaces) to the practical implementation: when faced with a wide range of models, 
one needs to determine which models to pair together — they must be similar 
enough — and how to pair them — so that the jumps are efficient enough. This 
requires the calibration of a large number of proposals, whose efSciency is usually 
much lower than in single-model implementations. Whenever the number of 
models is limited, my personal experience is that it is more efficient to run 
separate (and parallel) MCMC algorithms on all models and to determine the 
corresponding posterior probabilities of those models by a separate evaluation, 
like Chib's (1995). (Indeed, a byproduct of the RJMCMC algorithm is to provide 
an evaluation of the posterior probabilities of the models under comparison via 
the frequencies of accepted moves to such models.) See, e.g., Lee et al. (2009) 
for an illustration in the setting of mixtures of distributions. 
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5. Approximate Bayesian computation methods 

This section covers some aspects of a specific computational metliod called Ap- 
proximate Bayesian computation (ABC in short), which stemmed from acute 
computational problems in statistical population genetics and rised in impor- 
tance over the past decade. The section should be more methodological than 
the previous sections as the method is not covered in this volume, as far as I 
can assess. In addition, this is a special computational method in that it has 
been specifically developed for challenging Bayesian computational problems 
(and that it carries the Bayesian label within its name!). Although the reader is 
referred to, e.g., Toni et al. (2009) and Beaumont (2010) for a deeper review on 
this method, I will cover here different accelerating techniques and the numerous 
calibration issues of selecting both the tolerance and the summary statistics. 

Approximate Bayesian computation (ABC) techniques appeared at the end 
of the 20th Century in population genetics (Tavare et al., 1997; Pritchard et al., 
1999), where scientists were faced with intractable likelihoods that MCMC 
methods were simply unable to handle with the slightest amount of success. 
Some of those scientists developed simulation tools overcoming the jamming 
block of computing the likelihood function that turned into a much more gen- 
eral form of approximation technique, exhibiting fundamental links with econo- 
metric methods such as indirect inference. Although some part of the statistical 
community was initially reluctant to welcome them, trusting instead massively 
parallelised MCMC approaches, ABC techniques are now starting to be part of 
the statistical toolbox and to be accepted as an inference method per se, rather 
than being a poor man's alternative to more mainstream techniques. While de- 
tails about the method are provided in recent surveys (Beaumont, 2008, 2010; 
Marin et al., 2011b), we expose in algorithmic terms the basics of the ABC 
algorithm: 

Algorithm 3 ABC 

for t = 1 to T do 
repeat 

Generate 8* from the prior 7r(-). 
Generate x* from the model f{-\6*)- 
Compute the distance p(S'{x''), 5(x*)). 
Accept e* if p{S{x.O), 5(x*)) < e. 
until acceptance 
end for 

The idea at the core of the ABC method is to replace an acceptance based on 
the unavailable likelihood with one evaluating the pertinence of the parameter 
from the proximity between the data and a simulated pseudo-data. 

Example 4 (bis) While the MA(p) is manageable by other approaches — since 
the missing data structure is of a moderate complexity — , it provides an illustra- 
tion of a model where the likelihood function is not available in closed form and 
where the data can be simulated in a few lines of code given the parameter. Using 
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the p first autocorrelations as summary statistics S{-), we can then simulate pa- 
rameters from the prior distribution and corresponding series x* = {xl, . . . ,x^) 
and only keep the parameter values associated with the smallest S'(x*)'s. 

As shown in Figure 8, reproduced from Marin et al. (2011b), there is a differ- 
ence between the genuine posterior distribution and the ABC approximation, 
whatever the value of e is. This comparison also shows that the approximation 
stabilises quite rapidly as e decreases to zero, in agreement with the general 
argument that the tolerance should not be too close to zero for a given sample 
size (Fearnhcad and Prangle, 2012). -4 

ABC suffers from an "informa- 
tion paradox" , namely that it 
rarely pays to increase the di- 
mension of the summary statis- 
tic S{-) in the hope to bring the 
ABC inference closer to a "per- 
fect" Bayesian inference based 
on the whole data. 

Indeed, the acceptance step 
in the above ABC algorithm is 
almost always based on a dis- 
tance between summary statis- 
tics, £i(S'(x°), 5(x*)), rather than 
between the raw data and its 
simulated counterpart. For in- 
stance, in Example 4 (bis), us- 
ing the raw time series instead 
of the vector of empirical au- 
tocorrelations would have been 
strongly detrimental as the dis- 
tance between two simulated se- 
ries grows with the time hori- 
zon and brings very little infor- 
mation about the value of the 
underlying parameter. In other 

words, it forces us to use a much larger tolerance e in the algorithm. The paradox 
is easily explained: 

- the (initial) intuition upon which ABC is built considers the limiting case 
e ~ and the fact that 7rABc('|x*') is an approximation to 7r(-|x°), as 
opposed to the true setting being that 7rABc('|'S'(y)) is an approximation 
to 7r(-|S'(x'^)) and that it incorporates a Monte Carlo error as well; 

- for a given computational effort, the tolerance e is necessarily positive — if 
only to produce a positive acceptance rate — and deeper studies show that 
it behaves like a non-parametric bandwidth parameter, hence increasing 
with the dimension of S while (slowly) decreasing with the sample size. 





Fig 8. Variatton of the estimated distributions 
of ABC samples using different quantiles on the 
simulated distances for e (^10% m blue, 1% in red, 
and 0.1% in yellow) when compared with the true 
marginal densities. The observed dataset is sim- 
ulated from an MA{2) model with n = 100 ob- 
servations and parameter § — (0.6, 0.2) (Source: 
Marin et al., 2011b). 



imsart-generic ver. 2011/11/15 file: BCP.tex date: April 9, 2013 



/Bayesian Computational Tools 18 

Therefore, when the dimension of the raw data is large (as for instance in the 
time series setting of Example 4 bis), it is definitely not recommended to use 
a distance between the raw data x" and the raw pseudo-data x*: the curse of 
dimension operates in nonparametric statistics and clearly impacts the approx- 
imation of 7r(-|x*') as to make it impossible even for moderate dimensions. 

In connection with the above, it must be stressed that, in almost any imple- 
mentation, the ABC algorithm is not correct for at least two reasons: the data 
x° is replaced with a roughened version {x*;d{S{x^),S{x*)) < e} and the use 
of a non-sufficient summary statistic S{- ■ ■). In addition, as in regular Monte 
Carlo approximations, a given computational effort produces a corresponding 
Monte Carlo error. 



5.1. Selecting summaries 

The choice of the summary statistic S'(-) is paramount in any implementation 
of the ABC methodology if one does not want to end up with simulations from 
the prior distribution resulting from too large a tolerance! On the opposite, an 
efficient construction of 5'(- • • ) may result in a very efficient approximation for 
a given computational effort. 

The literature on ABC abounds with more or less recommendable solutions 
to achieve a proper selection of the summary statistic. Early studies were either 
experimental (McKinley et al., 2009) or borrowing from external perspectives. 
For instance, Blum and Frangois (2010) argue in favour of using neural nets in 
their non-parametric modelling for the very reason that neural nets eliminate 
irrelevant components of the summary statistic. However, the black box features 
of neural nets also mean that the selection of the summary statistic is implicit. 
Another illustration of the use of external assessments is the experiment ran by 
Sedki and Pudlo (2012) in mixing local regression (Beaumont et al., 2002) local 
regression tools with the BIC criterion. 

In my opinion, the most accomplished (if not ultimate) development in the 
ABC literature about the selection of the summary statistic is currently found 
in Fearnhcad and Prangle (2012). Those authors study the use of a summary 
statistic S from a decisional perspective, obtaining in addition a determination 
of the optimal bandwidth (or tolerance) h from non-parametric evaluations of 
the error. In particular, the authors argue that the optimal summary statistic 
is E[0|x°] (when estimating the parameter of interest 9). For this, they notice 
that the errors resulting from an ABC modelling are of three types: 

- one due to the approximation of 7r(0|x°) by 7r(6'|S'(x°)), 

- one due to the approximation of Tr{9\S{x^)) by 

j7:is)K[{s-Six")}/h]7rie\s)ds 

7rABc(e|^(x )) ^ J^^,)K[{s-S{xO)}/h]ds 

where K{-) is the kernel function used in the acceptance step — equal to 
the indicator I(-i.i) in the original algorithm — , 
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- one due to the approximation of ttj<^-qc{9\S{x'^)) by importance Monte 
Carlo techniques based on N simulations, which amounts to vsa-{a{6)\S {x'^)) / N;^, 
if A^acc is the expected number of acceptances. 

For the specific case when 5(x) = E[6'|x] — 6, the error satisfies 

E[L{e,e)\x"] = tTacc{AL) + h^ f x^AyiK{x)dyi + o{h^). 

which means that the first type error vanishes with small /I's, given that it is 
equivalent to the Bayes risk based on the whole data. From this decomposition 
of the risk, Fearnhead and Pranglc (2012) derive 

h - o(7v-i/(4+'^)) 

as an optimal bandwidth for the standard ABC algorithm. From a practical 
perspective, using the posterior expectation E[0|x°] as a summary statistic is 
obviously impossible, if only because even basic simulation from the posterior 
is impossible. Fearnhead and Prangle (2012) suggest using instead a two-stage 
procedure: 

1. Run a basic ABC algorithm to construct a non-parametric estimate of 
E[6'|xO] following Beaumont et al. (2002); and 

2. Use this non-parametric estimate as the summary statistic in a second 
ABC run. 

In cases when producing the reference sample is very costly, the same sample 
may be used in both runs, even though this may induce biases that will simply 
add up to the many approximative steps inherent to this procedure. 

In conclusion, the literature on the topic has gathered several techniques pro- 
posed for other methodologies. While this perspective manages to eliminate the 
less relevant components of a pool of statistics, I feel the issue remains quite 
open as to which statistic should be included at the start of an ABC algorithm. 
The problems linked with the curse of dimensionality ( "not too many" ) , identi- 
fiability ("not too few"), and ultimately precision ("as many as components of 
0" ) of the approximations are far from solved and I thus foresee further major 
developments to occur in the years to come. 

5.2. ABC model choice 

As stressed already above, model choice occupies a special place in the Bayesian 
paradigm and this for several reasons. First, the comparison of several models 
compels the Bayesian modeller to construct a meta-model that includes all these 
models under comparison as special cases. This encompassing model thus has a 
complexity that is higher than the complexities of the models under comparison. 
Second, while Bayesian inference on models is formally straightforward, in that 
it computes the posterior probabilities of the models under comparison — even 
though this raises misunderstanding and confusion in the non-Bayesian applied 
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communities, as illustrated by the series of controversies raised by Templeton 
(2008; 2010 — , the computation of such objects often faces major computational 
challenges. 

From an ABC perspective, the specificity of model selection holds as well. At 
first sight, and in sort of predictable replication of the theoretical setting, the 
formal simplicity of computing posterior probabilities can be mimicked by an 
ABC-MC (for model choice) algorithm as the following one (Toni and Stumpf, 
2010): 

Algorithm 4 ABC-MC 

for t = 1 to T do 
repeat 

Generate m* from the prior tt{M = m). 
Generate d* , from the prior 7r^«(-). 
Generate x* from the model /m* ('I'^J^* )• 
Compute the distance p(S(x*'), 5(x*)). 
Accept {e*^,,m*) if p(S(xO), 5(x*)) < e. 
until acceptance 
end for 

where AA denotes the unknown model index, m being one of the possible 
values, with 7r,„ the corresponding prior on the parameter 9m- 



As a consequence, the above 
algorithm process the pair (m, 
0,„) as a regular parameter, 
using the same tolerance con- 
dition ^(^(xO), S'(x*)) < eas 
the initial ABC algorithm. From 
the output of ABC-MC, the 
posterior probability 7r(A^ = 
?7i|y) can then be approximated 
by the frequency of acceptances 
of simulations from model m 




Fig 9. Box-plots of the repartition of the ABC poste- 
rior probabilities that a normal (Gauss) and double- 
exponential (Laplace) sample is from a normal (vs. 
double- exponential) distribution, based on 250 replica- 
tions and the median as summary statistic S (^Source: 
Marin et al., 2011a). 



n(M = m\y) = — 



T 

t=l 



l(*) = 



Improvements on this crude frequency estimate can be made using for instance 
a weighted polychotomous logistic regression estimate of t:{M. — m-jy), with 
non-parametric kernel weights, as in Cornuet et al. (2008). 

Example 1 (quinquies) If we resume our comparison of the normal and double- 
exponential models. Running ABC-MC in this case means 

1. picking normal to = 1 or double-exponential m — 2 with probability 1/2; 

2. simulating /.t^ ~ A/'(0, a^); 

3. simulating a normal A/'(/ii , 1) sample x* if tti = 1 and a double-exponential 
£(^2, Vv^) sample x* if to = 2; 
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4. compare S'(x'') and 5(x*) 

While the choice of S{-) is unhmited, some choices are relevant and others are 
to be avoided as discussed in Robert et al. (2011). Figures 9 and 10 show the 
difference in using for S the median of the sample (Figure 9) and the median 
absolute deviation (mad, defined as the median of the absolute values of the 
differences between the sample and its median, med(|a;i — emd(a;i)|)) statistics 
(Figure 10). In the former case, double exponential samples are not recognised 
as such and the posterior probabilities do not converge to zero. In the later case, 
they do, which means the ABC Bayes factor is consistent in this setting. -^ 

The conclusion of Robert "=i"» "='»»» "='»»»» 

ct al. (2011) is that the out- 
come of an ABC model choice 
based on a summary statis- 
tic that is insufficient may be 
untrustworthy and need to be 
checked by additional Monte 
Carlo experiments as those pro- 
posed in DIYABC (Cornuet 
et al., 2008). More recently, 
Marin et al. (2011a) exhibited 

conditions on the summary statistic for an ABC model choice approach to pro- 
vide a consistent solution. 



Fig 10. Same legend as Fig. 9 when the summary 
statistic S is the mad statistic ('Source; Marin ct al, 
2011a). 



6. Beyond 

This chapter provides a snapshot via a few illustrations of the diversity of 
Bayesian computational techniques. It also misses important directions, like 
the particle methods which are particularly suited for complex dynamical mod- 
els (Del Moral et al., 2006; Andrieu et al., 2011). Or variational Bayes tech- 
niques which rely on optimised approximations to a complex target distribu- 
tion (Jaakkola and Jordan, 2000). Or yet cruder approximations to the likeli- 
hood function based on higher order asymptotics (Ventura et al., 2009). Simi- 
larly, I did not mention recent simulations methodologies that coped with non- 
parametric Bayesian problems (Hjort et al., 2010) and with stochastic processes 
(Beskos ct al., 2006). The field is expanding and the demands made by the "Big 
Data" crisis are simultaneously threatening the fundamentals of the Bayesian 
approach by calling for quick-and-dirty solutions and bringing new materials, 
by exhibiting a crucial need for hierarchical Bayes modelling. Thus, to conclude 
with Dickens' (1859) opening words, we may later consider that "it was the best 
of times, it was the worst of times, it was the age of wisdom, it was the age of 
foolishness" . 
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